December, 2014

Overview

  • Definitions, motives, spectrum
  • Current practices
  • A selection of tools to improve reproducibility
  • Challenges, standards & our role in the future of reproducible research

Definitions

"The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified."
  • Max Kuhn, CRAN Task View: Reproducible Research

Empirical - Statistical - Computational

Stodden, V., et al. 2013. "Setting the default to reproducible." computational science research. SIAM News 46: 4-6.

Motivations: Claerbout's principle

"An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result."
  • Claerbout and Karrenbach, Proceedings of the 62nd Annual International Meeting of the Society of Exploration Geophysics. 1992
"When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures"
  • Buckheit & Donoho, Wavelab and Reproducible Research, 1995.

Benefits are straightforward

  • Verification & Reliability: Easier to find and fix bugs. The results you produce today will be the same results you will produce tomorrow.
  • Transparency: Leads increased citation count, broader impact, improved institutional memory
  • Efficiency: Reuse allows for de-duplication of effort. Payoff in the (not so) long run
  • Flexibility: When you don'????t 'point-and-click' you gain many new analytic options.

But the limitations are substantial

Technical

  • Classified/sensitive/big data
  • Software licensing issues
  • Competition
  • Neither necessary nor sufficient for correctness (but essential for dispute resolution)

Cultural & personal

  • Very few researchers follow even minimal reproducibility standards.
  • No-one expects or requires reproducibility
  • No uniform standards of reproducibility, so no established user base
  • Inertia & embarassment

Our work exists on a spectrum of reproducibility

Peng 2011, Science 334(6060) pp. 1226-1227

Goal is to expose the reader to more of the research workflow

Current practices in many disciplines

  • Enter data in Excel
  • Use Excel for data cleaning & descriptive statistics
  • Import data into SPSS/SAS/Stata for further analysis
  • Use point-and-click options to run statistical analyses
  • Copy & paste output to Word document, repeatedly

Click-trails compromise clarity

  • Lots of human effort for tedious & time-wasting tasks
  • Error-prone due to manual & ad hoc data handling (column and row offsets are common)
  • Difficult to record - hard to reconstruct a 'click history'
  • Tiny changes in data or method require extensive reworking efforts

Scripted analyses support scientific integrity

  • Plain text files will be readable for a long time
  • Improved transparency, automation, maintanability, accessibility, portability, efficiency, communicability of process (what more could we want?)
  • But there's a steep learning curve

What am I doing to encourage sharing?

Using literate statistical programming

The alternative to point-and-click analyses

"Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do."– Donald E. Knuth, Literate Programming, 1984

For example… Let's calculate the current time in R.

time <- format(Sys.time(), "%a %d %b %X %Y")

The text and R code are interwoven in the output:

The time is `r time`

The time is Mon 15 Dec 12:55:42 AM 2014

Using an open source programming language

`

`

The machine-readable part

R: Free, open source, cross-platform, highly interactive, huge user community in academica and private sector

R packages: an ideal 'Compendium'?

"both a container for the different elements that make up the document and its computations (i.e. text, code, data, etc.), and as a means for distributing, managing and updating the collection… allow us to move from an era of advertisement to one where our scholarship itself is published" - Gentleman and Temple Lang 2004

Using an open document formatting language

`

` alt text

Rmarkdown: lightweight document formatting syntax based on email text formatting. Easy to write, read and publish as-is.

The human-readable part

  • minor extensions to allow R code display and execution
  • embed images in html files
  • citations, captions, equations
  • insert LaTeX/HTML where needed

Using dynamic documents in R

  • Narrative and code in the same file or explicitly linked
  • When data or narrative are updated, the document is automatically updated
  • Data treated as 'read only'
  • Output treated as disposable

Using version control

Payoffs - Eases collaboration - Can track changes in any file type (ideally plain text), and who made them - Can revert file to any point in its tracked history

Costs - Unfamiliar to most social scientists - Takes time to master

Using convenient tools, services & support

`

` alt text

alt text

What can be done to improve code & data sharing?

type: alert

alt text Stodden (IASSIST 2010) sampled American academics registered at the Machine Learning conference NIPS (134 responses from 593 requests (23%). Red = communitarian norms, Blue = private incentives

alt text Stodden (IASSIST 2010) sampled American academics registered at the Machine Learning conference NIPS (134 responses from 593 requests (23%). Red = communitarian norms, Blue = private incentives

Speed up culture change with incremental steps

  • Promote culture change through positive attribution
  • Implement mechanisms to indicate & encourage degrees of compliance (ie. clear definitions for different levels of reproducibility), cf. Stodden's:
  • 'Reproducible': compendium of text-code-data online
  • 'Reproduced': compendium available and independently reproduced
  • 'Semi-Reproducible': when the full compendium is not released
  • 'Semi-Reproduced': independent reproduction with other data
  • 'Perpetually Reproducible': streaming data

Promote existing standards to normalise reproducible research

  • Schwab et al.: ER (Easily reproducible), CR (Conditionally reproducible), NR (Not reproducible)
  • Biostatistics kite-marking of articles (Peng 2009): D (data), C (code), R (both)
  • Reproducible Research Standard (Stodden 2009): we should
  • Release the full compendium on the internet
  • License media such as text, figures, tables with Creative Commons Attribution license (CC-BY)
  • License code with one of Apache 2.0, MIT, LGPL, BSD, etc.
  • License "selection and arrangement" of data with CC0 or CC-BY

Use the Center for Open Science's badges

alt text

An incentive to share data and code by acknowledging open practices with badges in publications. Currently used by Psychological Science

Integrate these values into everyday tasks

  • Train students by putting homework, assignments & dissertations on the reproducible research spectrum
  • Publish examples of reproducible research in our field
  • Request code & data when reviewing
  • Submit to & review for journals that support reproducible research
  • Critically review & audit data management plans in grant proposals
  • Consider reproducibility wherever possible in hiring, promotion & reference letters.

Thanks

>"Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry."

-Raymond, E. S., 2004, The art of UNIX programming: Addison-Wesley.

Colophon